Source Code: https://github.com/djlofland/DATA622_MachineLearning/tree/master/Homework2
Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html
We will be working with the Penguin dataset again as we did for Homework #1. Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.
Using the target variable, Species, please conduct:
You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices.
Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors).
Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model.
Look at the fit statistics/ accuracy rates.
# Load dataframe
df <- penguins %>%
as_tibble()
# drop the year column
df <- df %>%
dplyr::select(-year)
# show summary infio
skim(df)| Name | df |
| Number of rows | 344 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
Our data set has the following independent features:
island (Biscoe, Dream or TorgerSen)bill_length_mm and bill_depth_mmflipper_length_mmbody_weight_gsexyear … I assume this is the year the penguin was measuredOur Dependent variable is species, a categorical variable with 3 distinct values: Adelie, Chinstrap, and Gentoo.
Per instructions, we will drop the year variable. In theory, if we expected penguin characteristics to be changing over time (weather, access to food, ecological problems, etc), then year would be a potentially useful feature. If we assume there are no dramatic time dependent shifts taking place, then year probably doesn’t provide any additional resolution toward prediction.
We know from the summary above that we have a few NAs. However, it is useful to know how those NAs cluster before making any decisions on how to handle them.
# show missing data
gg_miss_upset(df)We have some NAs that need handling. Since 2 records are missing most of their data, I’ll drop those entirely. This leave 9 records missing the sex feature. Since sex might be meaningful we can either KNN impute or drop those rows. Since we only have a few rows, I am going to drop them for convenience. If there had been a larger number then we would have to consider whether to drop the feature. If we chose to impute, the best strategy would be KNN imputing based on other row features; however, while imputing is a “go to” strategy, it does in fact carry the risk of biasing our data by reducing variance. All impute techniques run this risk - when we fill in values with mean, median, or imputed, we are artificially weighing our data. This can work in our favor by, but can also work against us. As such, the decision to impute should be explored in more depth given the specific problem.
# remove rows with NAs
df <- na.omit(df)Our island feature and species outcome are categorical - for machine learning, we really need everything coded as numeric. I will do a simple dummy encoding for island which will generate 3 new binary columns, one for each island. Note, dummy coding can be done with or without an intercept. I’ll skip so we will have 3 columns, not 2. For species prediction, we prefer to only have a single column. A typical strategy is to code with a numerical substitute for character values. Since we have 3 outcomes, we can map these to 1, 2 and 3.
# convert species to a factor
species <- as.factor(df$species)
# dummy encode our categorical `island`
dummies <- dummyVars(species ~ ., drop2nd=TRUE, data = df)
df <- data.frame(predict(dummies, newdata = df))## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'species' is not a factor
df$species <- species
skim(df)| Name | df |
| Number of rows | 333 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 146, Gen: 119, Chi: 68 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| island.Biscoe | 0 | 1 | 0.49 | 0.50 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
| island.Dream | 0 | 1 | 0.37 | 0.48 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▅ |
| island.Torgersen | 0 | 1 | 0.14 | 0.35 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▁ |
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
| sex.female | 0 | 1 | 0.50 | 0.50 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
| sex.male | 0 | 1 | 0.50 | 0.50 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
rm(dummies)
rm(species)Between feature correlations are a huge problem for most regressions. There are better techniques for which multicollinearity is not an issue. The problem with multicollinearity is that our algorithms cannot tease out which variables are contributing to the outcome when they are correlated. When we see correlated features, we need to arbitrarily remove some or reach for a technique like PCA which transforms the features into a new dimensional space with complete independence. The problem with PCA is that we loose all explainibility of the features in our model. Typically we do a forward or reverse step-wise approach to select the minimal set of features which provide the best model performance.
# Calculate and plot the Multicollinearity
correlation = cor(df[,-10], use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu"))rm(correlation)# Show feature correlations/target by decreasing correlation
df$speciesNumeric <- as.numeric(df$species)
stack(sort(cor(df[,1:9], df[, 11])[,], decreasing=TRUE))## values ind
## 1 0.85073678 flipper_length_mm
## 2 0.75043358 body_mass_g
## 3 0.73054767 bill_length_mm
## 4 0.59652777 island.Biscoe
## 5 0.01096361 sex.male
## 6 -0.01096361 sex.female
## 7 -0.31536361 island.Dream
## 8 -0.41931810 island.Torgersen
## 9 -0.74034609 bill_depth_mm
We see some strong correlations between the features - this multicollinearity will be an issues with any standard linear regression approaches. If we are wanting the best model possible, we might need to consider an approach like PCA to reframe our features into a set that are independent. We also notice in this plot that bill_length_mm, body_mass_g, and flipper_length_mm are more strongly correlated. We also see these features positively correlated with both Biscoe Island and being male. Next, we see that flipper_length_mm, body_mass-g, bill_length_mm, and island.Biscoe have the highest correlations with the target species.
For regressions, we need to ensure there is a direct relationship between changes in each feature and our target outcome. Ideally, there should be a linear relationship. We can employ transformations, BoxCox, etc to help tease variables which don’t show a linear relationship.
Now lets explore any correlations between features and with the target. Note I separate Male and Female into separate chart.
males <- df %>%
dplyr::filter(sex.male == 1) %>%
dplyr::select(-c(sex.male, sex.female))
featurePlot(x=males[,1:7], y=males[,8], plot="pairs", auto.key=list(columns=3))females <- df %>%
dplyr::filter(sex.female == 1) %>%
dplyr::select(-c(sex.male, sex.female))
featurePlot(x=females[,1:7], y=females[,8], plot="pairs", auto.key=list(columns=3))rm(males)
rm(females)We were not given a separate hold out group - so to better measure model performance against unseen samples, I’ll set aside 20% of our initial data as a holdout Test group. While this may slightly lower our training performance (few samples means less information captured by the model), the trade-off is that we can identify under- and over-fitting and better measure model performance.
# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df$species, p = .8, list = FALSE, times = 1)
pengTrain <- df[ trainIndex,]
pengTest <- df[-trainIndex,]
# Cross validation setup for caret modeling
ctrl <- trainControl(method = 'cv',
number = 10,
classProbs = T,
savePredictions = TRUE)We train our model on the Training Data only, assess performance, then separately predict from the holdout test set and use that for final model performance.
I only included a subset of features based on EDA above: island.Biscoe, bill_length_mm,bill_depth_mm,flipper_length_mm, andbody_mass_g`.
library(pROC)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
lda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = pengTrain,
method="lda",
trControl = ctrl
)
summary(lda_model)## Length Class Mode
## prior 3 -none- numeric
## counts 3 -none- numeric
## means 15 -none- numeric
## scaling 10 -none- numeric
## lev 3 -none- character
## svd 2 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## xNames 5 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 3 -none- character
## param 0 -none- list
varImp(lda_model)## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## Adelie Chinstrap Gentoo
## island.Biscoe 0.00 100.00 100.00
## flipper_length_mm 99.75 91.61 99.75
## bill_length_mm 92.73 92.73 90.85
## bill_depth_mm 88.92 92.48 92.48
## body_mass_g 85.61 89.32 89.32
# --- Evaluate our model
train_pred <- as.factor(predict(lda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 116 2 0
## Chinstrap 1 53 0
## Gentoo 0 0 96
##
## Overall Statistics
##
## Accuracy : 0.9888
## 95% CI : (0.9676, 0.9977)
## No Information Rate : 0.4366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9825
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9915 0.9636 1.0000
## Specificity 0.9868 0.9953 1.0000
## Pos Pred Value 0.9831 0.9815 1.0000
## Neg Pred Value 0.9933 0.9907 1.0000
## Prevalence 0.4366 0.2052 0.3582
## Detection Rate 0.4328 0.1978 0.3582
## Detection Prevalence 0.4403 0.2015 0.3582
## Balanced Accuracy 0.9891 0.9795 1.0000
test_pred <- as.factor(predict(lda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 0 0
## Chinstrap 1 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9759
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 1.0000 1.0000
## Specificity 1.0000 0.9808 1.0000
## Pos Pred Value 1.0000 0.9286 1.0000
## Neg Pred Value 0.9730 1.0000 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.2000 0.3538
## Detection Prevalence 0.4308 0.2154 0.3538
## Balanced Accuracy 0.9828 0.9904 1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(lda_values <- defaultSummary(modelValues))## Accuracy Kappa
## 0.9846154 0.9759437
Note, I had problems running straight qda() within caret and train(). So, I went ahead and used stepDQA which applies a stepwise feature selection and evaluation using QDA.
qda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = pengTrain,
method="stepQDA",
trControl = ctrl,
metric = "Accuracy"
)## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82617; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.94617; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.158
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81333; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.9585; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.148
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82167; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.9585; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.147
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81733; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.94583; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.148
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81717; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95417; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.146
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95033; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.145
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.8175; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95883; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.143
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95033; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.142
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82583; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.137
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 240 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.825; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95417; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.137
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 268 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81752; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.94786; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.139
summary(qda_model)## Length Class Mode
## call 6 -none- call
## method 1 -none- character
## start.variables 0 -none- NULL
## process 4 data.frame list
## model 2 data.frame list
## result.pm 2 -none- numeric
## runtime 3 -none- numeric
## performance.measure 1 -none- character
## formula 3 formula call
## fit 8 qda list
## xNames 5 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 3 -none- character
## param 0 -none- list
varImp(qda_model)## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## Adelie Chinstrap Gentoo
## island.Biscoe 0.00 100.00 100.00
## flipper_length_mm 99.75 91.61 99.75
## bill_length_mm 92.73 92.73 90.85
## bill_depth_mm 88.92 92.48 92.48
## body_mass_g 85.61 89.32 89.32
# --- Evaluate our model
train_pred <- as.factor(predict(qda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 112 5 0
## Chinstrap 3 48 1
## Gentoo 2 2 95
##
## Overall Statistics
##
## Accuracy : 0.9515
## 95% CI : (0.9185, 0.9739)
## No Information Rate : 0.4366
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9239
##
## Mcnemar's Test P-Value : 0.418
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9573 0.8727 0.9896
## Specificity 0.9669 0.9812 0.9767
## Pos Pred Value 0.9573 0.9231 0.9596
## Neg Pred Value 0.9669 0.9676 0.9941
## Prevalence 0.4366 0.2052 0.3582
## Detection Rate 0.4179 0.1791 0.3545
## Detection Prevalence 0.4366 0.1940 0.3694
## Balanced Accuracy 0.9621 0.9270 0.9832
test_pred <- as.factor(predict(qda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 0 12 1
## Gentoo 1 0 22
##
## Overall Statistics
##
## Accuracy : 0.9538
## 95% CI : (0.871, 0.9904)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9274
##
## Mcnemar's Test P-Value : 0.3916
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.9231 0.9565
## Specificity 0.9722 0.9808 0.9762
## Pos Pred Value 0.9655 0.9231 0.9565
## Neg Pred Value 0.9722 0.9808 0.9762
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1846 0.3385
## Detection Prevalence 0.4462 0.2000 0.3538
## Balanced Accuracy 0.9689 0.9519 0.9664
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(qda_values <- defaultSummary(modelValues))## Accuracy Kappa
## 0.9538462 0.9274013
nb_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = pengTrain,
method="nb",
rControl = ctrl
)
summary(nb_model)## Length Class Mode
## apriori 3 table numeric
## tables 5 -none- list
## levels 3 -none- character
## call 7 -none- call
## x 5 data.frame list
## usekernel 1 -none- logical
## varnames 5 -none- character
## xNames 5 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 3 -none- character
## param 1 -none- list
varImp(nb_model)## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## Adelie Chinstrap Gentoo
## island.Biscoe 0.00 100.00 100.00
## flipper_length_mm 99.75 91.61 99.75
## bill_length_mm 92.73 92.73 90.85
## bill_depth_mm 88.92 92.48 92.48
## body_mass_g 85.61 89.32 89.32
# --- Evaluate our model
train_pred <- as.factor(predict(nb_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 116 5 0
## Chinstrap 1 50 0
## Gentoo 0 0 96
##
## Overall Statistics
##
## Accuracy : 0.9776
## 95% CI : (0.9519, 0.9917)
## No Information Rate : 0.4366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9648
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9915 0.9091 1.0000
## Specificity 0.9669 0.9953 1.0000
## Pos Pred Value 0.9587 0.9804 1.0000
## Neg Pred Value 0.9932 0.9770 1.0000
## Prevalence 0.4366 0.2052 0.3582
## Detection Rate 0.4328 0.1866 0.3582
## Detection Prevalence 0.4515 0.1903 0.3582
## Balanced Accuracy 0.9792 0.9522 1.0000
test_pred <- as.factor(predict(nb_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 3 0
## Chinstrap 1 10 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9385
## 95% CI : (0.8499, 0.983)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.902
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.7692 1.0000
## Specificity 0.9167 0.9808 1.0000
## Pos Pred Value 0.9032 0.9091 1.0000
## Neg Pred Value 0.9706 0.9444 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1538 0.3538
## Detection Prevalence 0.4769 0.1692 0.3538
## Balanced Accuracy 0.9411 0.8750 1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(nb_values <- defaultSummary(modelValues))## Accuracy Kappa
## 0.9384615 0.9020347
(models_summary <- rbind(lda_values, qda_values, nb_values))## Accuracy Kappa
## lda_values 0.9846154 0.9759437
## qda_values 0.9538462 0.9274013
## nb_values 0.9384615 0.9020347
All three models performed extremely well with LDA doing best. Given how well these performed, it might be worth learning these modeling techniques with a different data set where we do see differences and can relate those to how the model work. I would note that the classes had fairly good separation which problably led to good LDA performance. Naive Bayes makes an assumption of independence of features which might explain it’s poorer performance. We didn’t do anything to handle multicolinearity. These models performed sufficiently that I did not go back and do further feature cleanup. If model performance had been lower, I would have considered normalizing my features (center, scale. and potentially BoxCox), but that just wasn’t necessary.